Intro

Sometimes you know nothing about a new dataset. The pointblank package is here to help and it has the scan_data() function. So simple, and it gives you so much information on a data table. The function generates an HTML report that scours the input table data.

In the same spirit, generating validation steps can be laborious and difficult at first. There’s a function available to kickstart that process: draft_validation(). It’ll generate a new .R file with a suggested validation plan that’s meant to work and is tweakable.

Performing table scans with scan_data()

The scan_data() function is available for providing an interactive overview of a tabular dataset. The reporting output contains several sections to make everything more digestible, and these are:

  • Overview: Shows table dimensions, duplicate row counts, column types, and reproducibility information
  • Variables: Provides a summary for each table variable and further statistics and summaries depending on the variable type
  • Interactions: Displays a matrix plot that describes the interactions between variables
  • Correlations: This is a set of correlation matrix plots for numerical variables
  • Missing Values: A summary figure that shows the degree of missingness across variables
  • Sample: A table that provides the head and tail rows of the dataset

The output HTML report will appear in the RStudio Viewer and can also be integrated in R Markdown or Quarto HTML output. Here’s an example that uses the penguins_raw dataset from the palmerpenguins package.

scan_data(tbl = palmerpenguins::penguins_raw, navbar = FALSE)

Overview of palmerpenguins::penguins_raw

Table Overview

Columns

17

Rows

344

NAs

336 (5.75%)

Duplicate Rows

0

Column Types

character 9
numeric 7
Date 1

Reproducibility Information

Scan Build Time

2022-10-13 00:06:10

pointblank Version

0.11.1.9000

R Version

R version 4.1.1 (2021–08–10)
Kick Things

Operating System

x86_64-apple-darwin17.0

Variables

Distinct

3

NAs

0

Inf/-Inf

0

Distinct

152

NAs

0

Inf/-Inf

0

Mean

63.15

Minimum

1

Maximum

152

Distinct

3

NAs

0

Inf/-Inf

0

Distinct

1

NAs

0

Inf/-Inf

0

Distinct

3

NAs

0

Inf/-Inf

0

Distinct

1

NAs

0

Inf/-Inf

0

Distinct

190

NAs

0

Inf/-Inf

0

Distinct

2

NAs

0

Inf/-Inf

0

Distinct

50

NAs

0

Inf/-Inf

0

Distinct

165

NAs

2

Inf/-Inf

0

Mean

43.92

Minimum

32.1

Maximum

59.6

Distinct

81

NAs

2

Inf/-Inf

0

Mean

17.15

Minimum

13.1

Maximum

21.5

Distinct

56

NAs

2

Inf/-Inf

0

Mean

200.92

Minimum

172

Maximum

231

Distinct

95

NAs

2

Inf/-Inf

0

Mean

4,201.75

Minimum

2,700

Maximum

6,300

Distinct

3

NAs

11

Inf/-Inf

0

Distinct

331

NAs

14

Inf/-Inf

0

Mean

8.73

Minimum

7.63

Maximum

10.03

Distinct

332

NAs

13

Inf/-Inf

0

Mean

−25.69

Minimum

−27.02

Maximum

−23.79

Distinct

11

NAs

290

Inf/-Inf

0

Interactions

Correlations

Missing Values

Sample

studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
1 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 2007-11-11 39.1 18.7 181 3750 MALE NA NA Not enough blood for isotopes.
2 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 2007-11-11 39.5 17.4 186 3800 FEMALE 8.94956 -24.69454 NA
3 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 2007-11-16 40.3 18.0 195 3250 FEMALE 8.36821 -25.33302 NA
4 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 2007-11-16 NA NA NA NA NA NA NA Adult not sampled.
5 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 2007-11-16 36.7 19.3 193 3450 FEMALE 8.76651 -25.32426 NA
6..339
340 PAL0910 64 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N98A2 Yes 2009-11-19 55.8 19.8 207 4000 MALE 9.70465 -24.53494 NA
341 PAL0910 65 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N99A1 No 2009-11-21 43.5 18.1 202 3400 FEMALE 9.37608 -24.40753 Nest never observed with full clutch.
342 PAL0910 66 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N99A2 No 2009-11-21 49.6 18.2 193 3775 MALE 9.46180 -24.70615 Nest never observed with full clutch.
343 PAL0910 67 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N100A1 Yes 2009-11-21 50.8 19.0 210 4100 MALE 9.98044 -24.68741 NA
344 PAL0910 68 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N100A2 Yes 2009-11-21 50.2 18.7 198 3775 FEMALE 9.39305 -24.25255 NA

As could be seen, the first two sections had a lot of additional information tucked behind detail views (with the Toggle details buttons) and within tab sets. Should this amount of information be a little overwhelming, there is the option to disable one or more sections. With scan_data()’s sections argument, you can specify just the sections that are needed for a specific scan.

The default value for sections is the string "OVICMS" and each letter of that stands for the following sections in their default order:

"O": "overview" "V": "variables" "I": "interactions" "C": "correlations" "M": "missing" "S": "sample"

This string can contain less key characters and the order can be changed to suit the desired layout of the report. For example, if you just need the Overview, a Sample, and the description of Variables in the target table, the string to use for sections would be "OSV".

The tbl supplied could be a data frame, tibble, a tbl_dbi object, or a tbl_spark object. Here are a few more datasets that could be scanned, this time using sections = "OSV":

scan_data(tbl = safetyData::adam_adae, sections = "OSV")
scan_data(tbl = safetyData::adam_advs, sections = "OSV")

The reporting generated by scan_data() can be presented in one of eight spoken languages: English ("en", the default), French ("fr"), German ("de"), Italian ("it"), Spanish ("es"), Portuguese ("pt"), Turkish ("tr"), Chinese ("zh"), Russian ("ru"), Polish ("pl"), Danish ("da"), Swedish ("sv"), and Dutch ("nl"). These two-letter language codes can be used as an argument to the lang argument.

Here’s an example that scans dplyr’s starwars dataset and creates the report in Danish.

scan_data(tbl = dplyr::starwars, sections = "OVS", lang = "da")

Overblik over dplyr::starwars

Tabel overblik

Kolonner

14

Rækker

87

NA-værdier

105 (8,62%)

Duplikerede rækker

0

Kolonnetyper

character 8
list 3
numeric 2
integer 1

Information om replikerbarhed

Tidspunkt for scan

2022-10-13 00:06:33

pointblank Version

0.11.1.9000

R Version

R version 4.1.1 (2021–08–10)
Kick Things

Styresystem

x86_64-apple-darwin17.0

Variable

Unikke værdier

87

NA-værdier

0

Inf/-Inf

0

Unikke værdier

46

NA-værdier

6

Inf/-Inf

0

Middelværdi

174,36

Minimum

66

Maksimum

264

Unikke værdier

39

NA-værdier

28

Inf/-Inf

0

Middelværdi

97,31

Minimum

15

Maksimum

1.358

Unikke værdier

13

NA-værdier

5

Inf/-Inf

0

Unikke værdier

31

NA-værdier

0

Inf/-Inf

0

Unikke værdier

15

NA-værdier

0

Inf/-Inf

0

Unikke værdier

37

NA-værdier

44

Inf/-Inf

0

Middelværdi

87,57

Minimum

8

Maksimum

896

Unikke værdier

5

NA-værdier

4

Inf/-Inf

0

Unikke værdier

3

NA-værdier

4

Inf/-Inf

0

Unikke værdier

49

NA-værdier

10

Inf/-Inf

0

Unikke værdier

38

NA-værdier

4

Inf/-Inf

0

Stikprøve

name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
1 Luke Skywalker 172 77 blond fair blue 19.0 male masculine Tatooine Human The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens Snowspeeder, Imperial Speeder Bike X-wing, Imperial shuttle
2 C-3PO 167 75 NA gold yellow 112.0 none masculine Tatooine Droid The Empire Strikes Back, Attack of the Clones, The Phantom Menace, Revenge of the Sith, Return of the Jedi, A New Hope
3 R2-D2 96 32 NA white, blue red 33.0 none masculine Naboo Droid The Empire Strikes Back, Attack of the Clones, The Phantom Menace, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens
4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope TIE Advanced x1
5 Leia Organa 150 49 brown light brown 19.0 female feminine Alderaan Human The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens Imperial Speeder Bike
6..82
83 Rey NA NA brown light hazel NA female feminine NA Human The Force Awakens
84 Poe Dameron NA NA brown light brown NA male masculine NA Human The Force Awakens T-70 X-wing fighter
85 BB8 NA NA none none black NA none masculine NA Droid The Force Awakens
86 Captain Phasma NA NA unknown unknown unknown NA NA NA NA NA The Force Awakens
87 Padmé Amidala 165 45 brown light brown 46.0 female feminine Naboo Human Attack of the Clones, The Phantom Menace, Revenge of the Sith H-type Nubian yacht, Naboo star skiff, Naboo fighter

It’s possible to export this reporting to a self-contained HTML file. To do so, use the export_report() function (this also works for every other type of reporting you’ll see in the Viewer).

# Use `scan_data()` and assign reporting to `tbl_scan`
tbl_scan <- scan_data(tbl = dplyr::storms, sections = "OVS")

# Write the `ptblank_tbl_scan` object to an HTML file
export_report(
  tbl_scan,
  filename = "tbl_scan-storms.html"
)
## ✔ The table scan has been written as `tbl_scan-storms.html`

Drafting a nice, new validation plan with draft_validation()

We can generate a draft validation plan in a new .R or .Rmd file using an input data table (just like with scan_data()). Using this workflow, the data table will be scanned to learn about its column data and a set of starter validation steps (constituting a validation plan) will be written.

Let’s draft a validation plan for the dplyr::storms dataset. Here’s a quick look at that table:

dplyr::storms
## # A tibble: 11,859 × 13
##    name   year month   day  hour   lat  long status        categ…¹  wind press…²
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>         <ord>   <int>   <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropical dep… -1         25    1013
##  2 Amy    1975     6    27     6  28.5 -79   tropical dep… -1         25    1013
##  3 Amy    1975     6    27    12  29.5 -79   tropical dep… -1         25    1013
##  4 Amy    1975     6    27    18  30.5 -79   tropical dep… -1         25    1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropical dep… -1         25    1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropical dep… -1         25    1012
##  7 Amy    1975     6    28    12  33.3 -78   tropical dep… -1         25    1011
##  8 Amy    1975     6    28    18  34   -77   tropical dep… -1         30    1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropical sto… 0          35    1004
## 10 Amy    1975     6    29     6  34   -74.8 tropical sto… 0          40    1002
## # … with 11,849 more rows, 2 more variables:
## #   tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>, and
## #   abbreviated variable names ¹​category, ²​pressure

Here’s how we generate the new .R file:

draft_validation(
  tbl = ~dplyr::storms, # This `~` makes it an expression for getting the data
  tbl_name = "storms", 
  file_name = "storms-validation"
)

Check out the new file called "storms-validation.R"! It’s ready to run, all the validation steps run without failing test units, and the process (thanks to column inference routines) knows what to do with certain types of columns (like the latitude and longitude ones).

Once in the file, it’s possible to tweak the validation steps to better fit the expectations to the particular domain. It’s best to use a data extract that contains a good amount of rows and is relatively free of spurious data.


SUMMARY

  1. It’s a great idea to examine data you’re unfamiliar with with scan_data()!
  2. The draft_validation() function can give you a super-quickstart for data validation (it scans your data, but in a different way).